File Handling and Memory Management

The following are frequently used in Python and a good understanding of each is required.

File Handling

File Handling in Python is relatively simple. You are able to read, write and edit files in various formats.

Let's look at a few simple examples

Opening and writing to a text file

A simple files open and write operations:

If the file doesn't exist it will be created. If it does exist it will be overwritten. After the file is written we close it.

We use the w character to tell Python we want to write to the file.

a_file = open('a_file.txt', 'w')
a_file.write('My name is Michael Caine, not a lot of people know that')
a_file.close()

Appending to a file

To append a file we use an a character.

a_file = open('a_file.txt', 'a')
a_file.write(' I\'m a famous movie star.')
a_file.close()

Appending to a file with some new lines

a_file = open('a_file.txt', 'a')
a_file.write('\n \n')
a_file.write("I starred in the original Italian Job movie. I have been making movies for over 55 years")
a_file.close()

Pretty simple but takes up a lot of lines. We can shorten the code and take away the worry of closing the file manually by using the statement with to open our file.

with in Python makes life easier by managing resource acquisition such as file streams.

with open('a_file.txt', 'a') as a_file:

    a_file.write('\n \n')
    a_file.write("I starred in the original Italian Job movie. I have been making movies for over 55 years")

We are using the statement with to open our file. And that we do not call an explicit close function on the file. This is because with handles the file resource and as soon as we exit the with statement in our code the file resource is automatically released and closed.

You can write to a file using the w character and append to a file using the a character. If you just want to read from a file you can just open it.

In the following function we are opening a JSON file, and load the contents into a variable data. i.e. read the contents into data

Note, make sure you adjust the path to include the correct path to the file tutorials/starwars-data/films.json

Tip you can find the path you're running your code from by including the following in your code or running it from the Python Console.

import os
print(os.getcwd())

def read_json_file(json_file):

    import json

    # Opening JSON file
    with open(json_file) as jf:

        # returns JSON object as
        # a dictionary
        data = json.load(jf)

    return data

data = read_json_file('tutorials/starwars-data/films.json')
print(data)

The result from running the above code should be a print of the film data.

Using the above read_json_file function, let's write the data to a csv (comma separated values) file using the Python csv package.

import csv

def csv_create(file, columns, data):
    """
        Generate a csv (Comma Separated) file with columns and rows from the sorted data
        Writes the csv file to
        param: data: The data to write to
    """
    with open(file, 'w', encoding='UTF8') as f:

        try:
            writer = csv.DictWriter(f, fieldnames=columns, delimiter=',')
            writer.writeheader()

            # Iterate over data adding columns from the data
            for item in data:
                row = {}
                for col in columns:
                    row[col] = item[col]
                writer.writerow(row)

        except Exception as e:
            # Raise a generic error here - ideally need to be more specific...
            print(f"Problem setting writing to csv file --> {str(e)} status_code=500")
            raise Exception

    print("Written CSV file")

data = read_json_file('python/examples/starwars-data/films.json')
columns = ('title', 'episode_id', 'director', 'producer', "release_date")
csv_create("films.csv", columns, data['results'])

First, we read the data from the JSON file then create a list of columns we want from the data and finally call the csv_create function.

The csv_create function opens the file and uses the csv.DictWriter to specify the format. Each row of the csv file is written as a dictionary.

The code is straight forward enough. It loops over the data and for each film matches each column with data from the film adding to the row dictionary. It then writes the row dictionary to the csv file.

Notice we are using utf8 encoding when we open the file. It's always a good idea with text just in case there are any non-ascii characters.

Python Memory Management

Under the hood Python isn't actually Python, it's implemented in the 'C' programming language. In that 'C' code there is a Python Memory Manager that interacts with the operating system, handling the memory allocation and deallocation of all the objects and data in your Python code.

The following brief introduction to this subject is not meant to be a deep dive into Python memory management, but to provide a basic understanding of the basics and to provide you, the reader, with a good enough level of understanding to dig deeper without hesitation.

Memory Types

There are several types of memory when it comes to programming languages, Code, Heap, Stack, and Cached memory.

Code Code memory is where the instructions for your code to run are stored. As Python is an interpreter, the memory for code is allocated when each line of code is taken from the compiled bytecode and translated into runnable machine code.

Heap Heap memory is a non-static, dynamically allocated memory that is resizeable and non-contiguous. Heap memory is akin to global memory in as much as anything stored in heap is available from anywhere in your code modules, as long as you have imported the relevant objects. Everything stored in heap memory can change, i.e. when you add, modify or remove data. For this reason heap memory can become fragmented, chunks of related data may not be in contiguous order making access slower than it would be if they were. Heap memory may be stored in RAM or on hard disk if RAM becomes tight with lots of applications open simultaneously. The operating system tries to maximise the efficiency of RAM depending on what applications are running and their individual memory consumption.

Python has a private heap where it stores data structures and objects and uses a number of object-specific allocators that take care of allocating different types of objects and data structures to memory blocks. For example, an allocator for taking care of integers and another for dictionaries and lists etc. All these allocators are managed via the Python Memory Manager, and unlike when using a language like 'C', Python developers have no control over where data is stored and how much memory is allocated, Python literally takes care of everything related to memory management.

Stack Unlike heap memory, stack memory is both static and linear and used to store local data as the code is running, such as functions local variables. Anything stored on the stack is fixed in size and cannot be changed. Stack memory cannot be reallocated and has a LIFO (last-in first out) paradigm. Imagine stacking some books on top of each other. When unstacking the books the last one stacked is removed first, freeing up space. That is how stack memory is managed.

Python stores references to function/method calls and local variables on the stack. Once a function returns that stack memory is released. This ensures that the stack memory does not get bloated with function and object references no longer required.

Cache Cache memory is fast access memory that generally lives in RAM. RAM is extremely fast compared to disk memory. Frequently used data is often stored in cache memory. How much data stored in cache depends on the size of the RAM on any individual computer. Most applications try to take advantage of cache memory as much as possible, but it's the operating system that will normally manage the allocation of cache distribution. A number of databases use cache for storing data, and automatically load data from files on disk into cache. This is particularly relevant to a lot of NOSQL databases.

Once an application is terminated and or a computer shutdown, anything in cache, unless saved to disk, is permanently lost. You cannot recover data from cache memory as you can from disk files.

Memory Efficiency

Python takes care of sharing common data items and doesn't store the same data at different memory addresses. The code below will make this clear.

Everything in Python has an id. This points to a memory address. Normally we address memory using hexadecimal notation, but in either case below, the values are the same.

x = 1
y = 1
print(id(x))
print(hex(id(x)))
print('-------------------------')
print(id(y))
print(hex(id(y)))

It's the same for strings or any other literal values. Even if the same values are in different functions. Of course as soon as a value changes so does the memory address.

def foo():
    x = "Richard"
    print("Richard x ", hex(id(x)))
    x = "Tayfun"
    print("Tayfun x ", hex(id(x)))

def bar():
    y = "Richard"
    print("Richard y ", hex(id(y)))
    y = "Tayfun"
    print("Tayfun y ", hex(id(y)))

foo()
bar()

Results

Richard x  0x10507f530
Tayfun x  0x10507f0b0
Richard y  0x10507f530
Tayfun y  0x10507f0b0

Obviously the result will not be the same on another machine, but what is important here is that both 'Richard' 'x' and 'y' and 'Tayfun' 'x' and 'y' share the same address

This principle does not hold for mutable data structures such as lists and dictionaries.

def foo():
    x = ["Richard"]
    print("Richard x ", hex(id(x)))
    x = ["Tayfun"]
    print("Tayfun x ", hex(id(x)))

def bar():
    y = ["Richard"]
    print("Richard y ", hex(id(y)))
    y = ["Tayfun"]
    print("Tayfun y ", hex(id(y)))

foo()
bar()

The results of the above are

Richard x  0x10aa4f840
Tayfun x  0x10aa61880
Richard y  0x10aa61880
Tayfun y  0x10aa4f840

Notice the pairs ('Richard x' and 'Tayfun y') and ('Tayfun x' and 'Richard y'). Those pairs share the same address. That is because the address has been reused based on availability.

But let's get back to Tuples because they are immutable.

def foo():
    x = ("Richard")
    print("Richard x ", hex(id(x)))
    x = ("Tayfun")
    print("Tayfun x ", hex(id(x)))

def bar():
    y = ("Richard")
    print("Richard y ", hex(id(y)))
    y = ("Tayfun")
    print("Tayfun y ", hex(id(y)))

foo()
bar()

Results

Richard x  0x10507f530
Tayfun x  0x10507f0b0
Richard y  0x10507f530
Tayfun y  0x10507f0b0

Above even though we have reassigned both x abd y we still have the same addresses for 'Richard' and 'Tayfun', 'x' and 'y' respectively in each function. These values will stay in memory until those functions exit and sometimes beyond.

Different objects in Python obviously have different memory costs, so in large applications when there is a lot of memory overhead it is worth paying a little bit of attention to memory usage.

The basics are thus, mutable objects are more memory intensive. The following has several functions, each containing the same literal data in different data structures,

import sys

def one():
    # A List
    x = ["Richard", 1, "Tayfun", 1, "Pablo", 1]
    print(x, sys.getsizeof(x))
    return x


def two():
    # A Tuple
    x = ("Richard", 1, "Tayfun", 1, "Pablo", 1)
    print(x, sys.getsizeof(x))
    return x


def three():
    # A dictionary
    x = {"Richard": 1, "Tayfun": 1, "Pablo": 1}
    print(x, sys.getsizeof(x))
    return x


def four():
    # A Set
    x = set(["Richard", 1, "Tayfun", 1, "Pablo", 1])
    print(x, sys.getsizeof(x))
    return x

a = one()
b = two()
c = three()
d = four()

Results

['Richard', 1, 'Tayfun', 1, 'Pablo', 1] 152
('Richard', 1, 'Tayfun', 1, 'Pablo', 1) 88
{'Richard': 1, 'Tayfun': 1, 'Pablo': 1} 232
{1, 'Tayfun', 'Richard', 'Pablo'} 216

Taking a look at the results above, you can clearly see which is the slimmest memory eater. With 88 bytes the clear winner is the Tuple. However, the larger the data set the less of a difference.

import sys

def list_vs_tuple():
    # A List
    x = [1] * (10 ** 6)
    y = tuple(x)
    print("list ", sys.getsizeof(x))
    print("tuple ", sys.getsizeof(y))

list_vs_tuple()

Running the above we get the following

list  8000056
tuple  8000040

The big 80000 on each is the data - all the 1's generated. The last two digits are the data structure itself. A difference of 16 bytes. Not much!

Let's take a look at some common memory storage measurements.

import sys
from decimal import Decimal

print(sys.getsizeof(2))
print(sys.getsizeof(2000000))
print(sys.getsizeof(2500.1010))
print(sys.getsizeof(Decimal(25.10)))

Running the above code will produce in Python 3.10, the following results

Interestingly, both the number 2 and 2000000 have the same number of bytes, i.e. 28. Python 3 has a base number of bytes per integer and that is 20. The extra 8 is there regardless of any number upto 8 bytes in length, so 2 is obviously 1 byte but fits into the memory slot of 8 bytes.

Try adding another zero to the 2000000 and see how many bytes we get. Do this until the byte size changes.

The last two printouts above are equally interesting. Notice that the floating point number, the third print statement, is less than an integer. That is because the floating point base bytes are 16, the 24 (8 extra bytes) gives plenty of room for our 25.10 floating point number.

But now look at the last printout. That uses the built-in package Decimal to convert to a decimal number. Look at the difference, it's 5 times larger thereabouts. That's a big difference. The issue with floating point numbers is that they are quite often approximations and not exact. Some integers and decimals are precisely represented by a floating point number, but most of the time they are not.

The following example displays such a case

print(sum(1.5 for _ in range(9)))
print(sum(1.1 for _ in range(9)))

Running the above two statements will give you the following results

13.5
9.899999999999999

Now run it with the 1.1 wrapped in the Decimal function and see the difference for yourself.

To understand the complexities of the differences between using floating points and decimals see Floating Point Arithmetic: Issues and Limitations

Let's look at strings in memory

import sys

print("empty string ''", sys.getsizeof(''))
print("Richard", sys.getsizeof("Richard"))
print("Tayfun", sys.getsizeof("Tayfun"))
print("Richard Tayfun", sys.getsizeof("Richard Tayfun"))
x = "Richard " + "Tayfun"
print('"Richard" + "Tayfun"', sys.getsizeof(x))

Results

empty string '' 49
Richard 56
Tayfun 55
Richard Tayfun 63
"Richard" + "Tayfun" 63

A brief look at the results above states clearly that an empty string has a default of 49 bytes as a base. "Richard", 7 characters extra brings that number to 56 bytes, "Tayfun", 6 characters long to 55 bytes... and both together 63 bytes. For each character we have a byte of memory space taken up.

Let's see how it works with lists, tuples and sets

import sys

print("[] ", sys.getsizeof([]))
x = ["Richard" "Tayfun"]
print('["Richard", "Tayfun"]', sys.getsizeof(x))

print('()', sys.getsizeof(()))
x = ("Richard", "Tayfun")
print('("Richard", "Tayfun") ', sys.getsizeof(x))

print("{} ", sys.getsizeof({}))
x = {"Richard": 1, "Tayfun": 2}
print('{"Richard", "Tayfun"} ', sys.getsizeof(x))

print("set([]) ", sys.getsizeof(set([])))
x = set(["Richard", "Tayfun"])
print('set(["Richard", "Tayfun"])', sys.getsizeof(x))

Run the above in the console and check the results, they're interesting to say the least. Specifically, Look at the difference of the empty structures sizes. It gives plenty of food for thought into how you should group and save your data, individually or in data structures and which of those is best suited to the job.

Garbage Collector

Python's Memory Manager doesn't always return memory to the operating system. In some cases it may choose to hold the memory it has for future use. Python has its own garbage collection which periodically cleans up un-required data from memory. The garbage collection uses references to objects to determine if memory can be cleared. If an object in memory has zero references it is cleared.

Pythons backend 'C' code does a good job of automatically dumping stuff out of memory that is no longer referenced, deleted data...etc. As stated above, garbage collection is performed periodically by the Python Memory Manager. However, it is possible to manually call the Python garbage collector, but unless you are certain of what you are doing the rule of thumb, is to leave it to Python.

If interested, and on occasion for code control and memory issues, such as memory leaks, we can count how many times an object is referenced from our code. Look at the following example.

import sys
import gc

x = "Richard"
y = "Tayfun"

print("-----------------------")
print(f"Both {x} and {y} are software developers")
print(f"But {x} is older and wiser")
print("-----------------------")
print("x refs ", sys.getrefcount(x))

z = [x] # Add x to a list
print("-----------------------")
print("x refs after adding to list z", sys.getrefcount(x))
print("y refs ", sys.getrefcount(y))

Running the above will display the x and y reference counts,

However, there are some caveats to counting references. Python has a lot going on under the hood and although the core Python code is mostly C there are a number of different modules that are written in Python. for example importing sys just doesn't load a single module. To run sys there are a number of module dependencies.

import sys
print(len(sys.modules))

Run that and see how many modules there are.

Python shares stuff like small integers across various uses, these integers remain in memory. So even though the following code has 1 reference in it, the total number of references can be numerous

import sys
my_int = 1
print(sys.getrefcount(my_int))

Run the above, and you'll get the picture...

Occasionally there maybe reasons to manually garbage collect. There is a simple way to kill off variables and that is with the del function. Once you call del on a variable, the variable no longer exists.

import sys
my_int = 1
del my_int
print(sys.getrefcount(my_int))

The above will throw a NameError.

Pythons built in garbage collection functions can be imported via the gc module. This module has several methods, see Garbage Collector interface

Aside from reference counting, Python garbage collection uses something called GCC, (Generational Garbage Collection). GCC has two main attributes:

Generations There are three generations of objects for garbage collection, best described thus. Creation of a new object, placed in memory, is tracked by the underlying Python Memory Manager garbage collection utilities. This new object starts out in the first generation. When it is deleted or has no references the garbage collector attempts to remove it from memory. If for some reason, and we'll explore the most common soon, it cannot be deleted it is place in the second generation, again if it can not be deleted from memory whils tin that generation it is placed in the third generation.
Thresholds The garbage collection thresholds represent the number of objects that can be stored in each generation before garbage collection is triggered.

You can see the default number for each generation's threshold and actual number of objects for each generation by running

import gc
print(gc.get_threshold())
print(gc.get_count())

You can change the threshold values if required for a memory intensive operation increasing or decreasing as required.

import gc
print(gc.set_threshold(1000, 20, 20))
print(gc.get_threshold())

To see how the objects are accumulated in the generators run the following example

import gc
gc.collect()
print(gc.get_count())


def foo():
    x = ("foo1")
    y = ("foo2")


def bar():
    x = ("bar1")
    y = ("bar2")


foo()
print(gc.get_count())
bar()
print(gc.get_count())
del foo
print(gc.get_count())

You see how the count changed on the last 'print' statement after we had deleted foo

Now let's look at a typical example of why an object may not be deleted from memory and moved up a generation

import gc

gc.collect()
print(gc.get_count())


class SomeClass(object):
    pass

someobject = SomeClass()
print(gc.get_count())
someobject.obj = someobject
print(gc.get_count())
del someobject
print(gc.get_count())
gc.collect()
print(gc.get_count())

So what's happening in the above code. First we do a quick collection to get rid of anything hanging around. We then do a count to see how many objects are in each generation. After that we create a new class, generate an instance of that class, then assign the same instance to an object of itself, so it now contains a reference to itself. We then do another count of generation objects, delete the instance and another count. Even though we have deleted the object, and it is no longer accessible from the code, the generation count has not changed.

To confirm that the count changes after delete without the reference to itself, comment out the line someobject.obj = someobject and run it again.

This problem is known as a reference cycle. You can't delete an object that references itself. Here is where the manual garbage collection comes into play.

After the count we call the gc.collect again and this forces the garbage collection to fire up, after which the count will have changed.

That's a wrap on the Python Tutorial. We hope you found it useful as a guide and wish you well on your journey.